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Abstract 


The  Link  Probability  Model  (LPM)  can  be  used  as  an  alternative  to  Exponential  Random 
Graph  Models  (ERGM)  to  simulate  network  data.  The  EPM  characterizes  the  networks 
in  terms  of  link  probabilities  based  on  historical  frequencies.  In  this  paper,  the  LPM  is 
presented,  compared  and  contrasted  with  the  ERGM.  The  relative  utility  of  the  two 
approaches  is  examined  by  applying  both  to  four  longitudinal  data  sets.  The  relative 
strengths  and  weaknesses  of  the  two  approaches  in  terms  of  data  requirements,  scalability, 
and  assumptions  are  described. 
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Introduction 


Social  networks  often  exhibit  stoehastie  behavior.  For  example,  an  agent  in  a  network  might 
eommunieate  with  a  friend  several  times  during  a  given  day  and  not  at  all  during  another  day.  In 
this  example,  the  underlying  relationship  remains  the  same;  however,  the  observed  network  ties 
fluetuate.  This  is  an  intuitive  example,  however  the  aeeuraey  of  observed  network  data  has  been 
well  doeumented  in  the  literature  (Killworth,  et  al,  1976,  1979;  Bernard,  et  al,  1977,  1980,  1982; 
Kraekhardt,  1990;  Kashy  and  Kenny,  1990;  Wasserman  and  Faust,  1994).  Furthermore,  it  is 
possible  that  the  underlying  relationships  in  a  soeial  network  may  ehange  (Carley,  1991;  Doreian 
and  Stokman,  1997;  Snijders,  2007).  This  relatively  eommon  behavior  will  also  eause 
fluetuations  in  observed  network  data.  Therefore  statistieal  models  of  soeial  networks  are 
neeessary  for  any  kind  of  meaningful  inferenee  on  network  data. 

A  neeessary  prerequisite  for  statistical  inference  of  social  networks  is  an  underlying 
probability  strueture  for  the  presenee  of  links  in  the  network.  Deteeting  ehanges  over  time, 
eomparing  multiple  networks,  or  evaluating  a  wide  range  of  potential  hypothesis  all  depend  upon 
a  method  to  estimate  the  probability  of  links  oeeurring  in  an  observed  network.  Several 
statistical  models  have  been  proposed.  The  p*  model  was  introduced  by  Frank  and  Strauss 
(1986).  This  model  describes  the  distribution  of  a  Markov  random  graph.  Many  others  have 
eontributed  to  developing  this  family  of  models  (Strauss  and  Ikeda,  1990;  Wasserman  and 
Pattison,  1996;  Anderson,  et  al,  1999;  Wasserman  and  Faust,  1994),  espeeially  in  the  area  of 
parameter  estimation.  A  eommon  approaeh  to  deseribe  the  link  probability  is  the  Exponential 
Random  Graph  Model  (ERGM)  (Kraekhardt,  1998;  Handcoek,  2002,  2003;  Hunter,  2006; 
Goodreau,  2007;  Robins,  et  al,  2007;  Hunter,  et  al,  2008).  The  ERGM  is  based  on  a  regression 
of  struetural  variables  in  the  network  that  may  explain  the  probability  of  links  oeeurring  in  the 
network.  Several  have  used  the  ERGM  to  simulate  many  instances  of  a  given  network  and  then 
estimate  statistical  properties  of  various  network  measures  (Handeoek  et  al,  2006,  2007,  2008; 
Handeoek,  2008).  I  introduee  an  alternative  approaeh  with  the  Eink  Probability  Model  (EPM) 
that  uses  the  historical  presence  of  links  to  estimate  the  link  probability.  I  demonstrate  both 
simulation  approaehes  on  a  range  of  empirieal  data  and  show  that  for  a  limited  number  of 
longitudinal  data  sets,  the  EPM  provides  a  better  fit  to  the  data  than  the  ERGM. 

The  ERGM  is  a  family  of  statistieal  models  that  deseribe  the  probability  of  a  link  being 
present  between  two  nodes  and  is  a  eommon  statistical  model  for  soeial  network  analysis.  The 
models  are  based  on  logistie  regression,  where  model  terms  are  usually  struetural  variables  in  the 
network.  The  model  is  used  to  explore  statistieally  signifieant  properties  of  networks.  The 
ERGM  notation  is  also  flexible,  allowing  it  to  represent  a  wide  range  of  network  variables. 
Unfortunately,  many  ERGM  models  are  degenerate,  meaning  that  observed  data  might  be  highly 
improbable  given  the  model  (Handeoek,  2003,  2002).  The  ERGM  is  not  typieally  used  for  over¬ 
time  network  analysis,  however  Mark  Handeoek  presented  an  applieation  of  the  ERGM  for 
simulating  networks  at  the  28*''  Sunbelt  Conference  (2008). 

The  Link  Probability  Model  (EPM)  is  not  a  statistieal  model,  but  rather  a  matrix  of 
probabilities  of  a  link  being  present  between  ordered  pairs  of  nodes.  The  EPM  is  estimated  from 
longitudinal  networks  based  on  the  frequeney  of  links  being  present  over  time.  The  EPM  avoids 
issues  of  model  degeneraey  because  the  model  is  not  dependent  upon  highly  eorrelated  terms  and 
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there  are  more  data  points  than  parameter  estimates.  The  LPM  is  partieularly  useful  for  our 
applieation,  beeause  we  are  only  interested  in  modeling  over-time  data. 

First,  I  briefly  review  the  ERGM.  Then  the  LPM  is  deseribed  and  presented  as  an 
alternative  model  to  the  ERGM.  Then  the  LPM  and  ERGM  are  both  used  to  model  four  data 
sets:  the  Sampson  (1969)  Monastery  data,  the  Neweomb  (1961)  Eratemity  data,  and  then  two 
sets  of  data  from  Eort  Leavenworth  (Graham,  2005;  Bailer,  et  al,  2008).  These  are  four 
interesting  data  sets  beeause  they  all  have  a  temporal  eomponent  and  have  been  well  doeumented 
in  the  literature.  The  fit  of  eaeh  of  these  models  is  eompared  to  the  data.  I  find  that  the  ERGM  is 
degenerate  for  the  Eort  Leavenworth  data  and  that  the  LPM  provides  a  better  fit  in  the  other  two 
data  sets  under  eertain  eonditions.  I  eonelude  by  diseussing  the  strengths  and  limitations  of  LPM 
and  its  general  usefulness  to  network  analysts. 


Exponential  Random  Graph  Model 


The  ERGM  is  used  in  soeial  network  analysis  as  a  statistieal  model  that  enables  an 
analyst  to  eonduet  inferenee  on  dependent  relational  data  (Goodreau,  2007;  Robins,  et.  ah,  2007). 
The  ERGM  is  therefore  less  restrictive  than  the  Holland  and  Leinhardt  (1981)  p\  model  that 
assumed  dyadic  independence.  In  many  social  network  applications  the  relationship  between 
two  individuals  depends  on  relationships  between  the  individual  and  others  in  the  network; 
cognitive  limits  on  the  number  of  relationships  that  can  be  maintained;  similarity  between 
individuals;  and  more.  The  ERGM  framework  for  relaxing  the  dyadic  independence  assumption 
is  thus  essential  for  accurate  inference  in  many  data  sets. 

Exponential  random  graph  models  (ERGM)  have  been  studied  a  great  deal  in  the 
literature  as  a  model  for  the  probability  of  links  occurring  in  a  social  network.  The  ERGM  was 
first  proposed  in  1986  (Erank  and  Strauss)  as  a  very  general  model.  The  ERGM  can  thus  be  used 
to  model  a  wide  range  of  explanatory  variables.  The  basic  ERGM  is  given  by, 

P{Y)^9,g,{y)  +  9^g^+...  +  9^g^{y)  (1) 

where  T  is  a  graph,  0‘s  are  model  coefficients,  and  g(y)  is  a  covariate  or  term  in  the  model. 
Covariate  terms  are  general  and  can  represent  many  features  of  a  graph.  These  terms  are  often 
structural  properties  of  the  graph  such  as  the  number  of  links,  dyadic  relations,  and  transitive 
properties,  among  others. 

Estimating  ERGM  terms  and  parameters  can  be  computationally  challenging  in  large 
networks  (Snijders,  2002;  Pattison  and  Robins,  2002).  Markov  chain  Monte  Carlo  estimation  of 
ERGM  has  been  used  to  fit  these  models  to  data  (Goodreau,  2007;  Robins,  et.  ah,  2007; 
Handcock,  2003,  2002;  Snijders,  2002;  Pattison  and  Robins,  2002).  The  Markov  dependence  in 
these  models  leads  to  problems  of  degeneracy,  which  is  discussed  in  detail  by  Handcock  (2003, 
2002).  Essentially,  model  degeneracy  occurs  when  the  observed  data  is  almost  impossible  under 
the  specified  model.  This  often  occurs  when  explanatory  terms  are  highly  correlated  and  there  is 
insufficient  data  to  construct  an  appropriate  model.  Many  of  the  terms  used  in  ERGM  are 
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correlated  and  it  is  diffieult  to  define  enough  terms  to  preelude  networks  that  do  not  represent  the 
data,  when  they  spuriously  satisfy  the  ERGM  terms.  Several  advanees  in  ERGM  have  been 
proposed  to  inelude  eurved  exponential  family  models  (Hunter  and  Handeoek,  2006)  and 
neighborhood  models  (Pattison  and  Robins,  2004).  However,  these  advanees  have  not 
completely  removed  issues  of  model  degeneracy. 


Link  Probability  Model  Formulation 


The  LPM  framework  for  viewing  the  probability  spaee  of  a  soeial  network  avoids  issues 
of  model  degeneracy,  while  preserving  flexibility  for  modeling  dyadic  relationships.  It  provides 
researehers  with  an  improved  means  to  understand  the  probability  space  of  the  network,  under 
eertain  eonditions.  The  EPM  is  a  square  matrix  where  the  rows  and  eolumns  eorrespond  to  the 
nodes  in  a  soeial  network.  The  entries  are  the  link  probabilities  of  the  direeted  link  from  the  row 
node  to  the  eolumn  node.  This  is  not  to  be  confused  with  an  adjaeency  matrix,  where  the  entries 
are  either  zero  or  some  number  representing  the  strength  of  a  relationship  between  nodes.  The 
link  probability  is  a  number  between  0  and  1,  and  determines  the  likelihood  of  a  link  being 
present  in  an  observed  adjaeeney  matrix. 

The  link  probabilities  ean  be  derived  from  empirical  data  in  several  ways.  Given  network 
data  eolleeted  over  multiple  time  periods  on  a  group  of  subjeets,  the  link  probabilities  ean  be 
estimated  by  the  proportion  of  link  oeeurrenees,  Uj,  for  eaeh  eell  in  the  adjaeeney  matrix,  Uij,  In 
the  case  of  eommunieation  networks,  statistieal  distributions  ean  be  fit  to  the  time  between 
messages  for  eaeh  potential  link  in  the  network.  Eor  a  speeified  period  of  time,  t,  the  link 
probability  p  for  eaeh  set  of  entities  i  and  j  ean  be  found.  Eet  Xy  be  the  time  between  messages  in 
a  eommunieation  network.  The  probability  density  funetion  for  any  x  ean  then  be  defined  as/y 
( X  I  By),  where  By  is  the  set  of  parameters  for  the  density  function.  Then,  the  probability, of  a 
link  oeeurring  within  some  time  period  t  is  the  probability  that  x  <  t,  whieh  ean  be  expressed  as, 

P  =  ,2) 

In  praetiee,  the  function/y(  x  |  By)  must  be  estimated  using  teehniques  sueh  as  maximum 
likelihood  estimation  from  empirical  data  collected  on  the  group  being  studied.  It  may  be 
desirable  to  eonstruet  a  network  based  on  a  restrietion  sueh  as,  “two  emails  within  a  time  period 
demonstrate  a  relationship,  but  one  does  not.”  In  this  ease,  it  is  neeessary  to  eompose  a  funetion 
of  random  variables.  If  hyil  \  t,  By)  represents  the  probability  density  funetion  of  time  between 
two  sets  of  two  emails  and/y  ( x  |  By  )  represents  the  probability  density  function  of  time  between 
one  set  of  two  emails,  then  the  following  is  true  under  eertain  assumptions: 
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It  is  possible  to  generalize  this  idea;  if  hy  (x  |  t,  Oy  )  is  the  probability  that  x  or  more 
communieations  oeeur  within  time  t,  then  the  following  is  true: 

The  LPM  is  an  important  improvement  over  some  traditional  models.  Individuals  in  a 
soeial  network  are  not  eonnected  to  other  individuals  with  uniform  random  probability.  The 
probability  strueture  is  mueh  more  eomplex.  Intuitively,  there  are  some  people  whom  a  person 
will  eommunieate  with  or  be  eonneeted  more  closely  than  others.  In  a  study  of  email 
communication  conducted  at  the  U.S.  Military  Academy  (McCulloh  et  al,  2007)  one  subject 
emailed  his  wife  more  than  ten  times  per  day  on  average,  while  other  people  that  he  worked  with 
received  an  email  from  him  once  or  twice  per  month.  For  this  reason,  real-world  networks  tend 
to  have  clusters  or  cliques  of  nodes  that  are  more  closely  related  than  others  (Newman,  2003; 
Topper  and  Carley,  1999;  Carley,  1996).  This  can  be  simulated  by  varying  the  probabilities  that 
certain  nodes  will  communicate.  In  this  way,  stochastic  behavior  in  dynamic  social  networks  can 
realistically  be  simulated. 

The  LPM  is  a  desirable  model  due  to  its  ability  to  accurately  model  empirical  data  and  its 
ability  to  avoid  degeneracy.  The  accuracy  of  the  LPM  will  be  discussed  in  the  Results  section. 
The  LPM  can  avoid  issues  of  model  degeneracy  because  the  only  parameters  for  the  model  are 
the  link  probabilities.  As  long  as  there  are  at  least  two  time  periods  for  estimating  parameters, 
there  are  more  data  points  than  there  are  parameters.  Each  link  is  treated  independently  of  other 
links  in  the  model;  therefore,  none  of  the  terms  are  correlated.  The  naive  assumption  of 
independence  between  links  is  corrected  by  the  historic  presence  of  links  over  time.  Intuitively, 
links  have  some  dependence.  For  example,  if  an  individual  chooses  to  communicate  with 
another,  the  likelihood  of  that  person  reciprocating  the  communication  increases.  If  we  assume  a 
dynamic  equilibrium  in  the  underlying  relationships  of  individuals  in  the  network,  these  patterns 
of  dependent  communication  will  be  apparent  over  time.  If  node  i  has  a  high  link  probability 
with  node  j,  it  may  be  likely  that  node  j  has  a  reciprocal  high  link  probability  with  node  i.  It  is 
not  necessary  to  directly  account  for  this  in  the  model.  If  the  relationship  is  true,  there  will  be  a 
high  expected  occurrence  of  i  to  j  and  j  to  i  links  in  the  networks  over  time.  The  LPM  will  model 
these  links  with  high  link  probability  due  to  their  over  time  frequency,  and  not  directly  from  their 
structural  dependency.  In  this  way,  the  LPM  can  never  be  over  specified,  have  high  variance 
inflation,  or  be  degenerate.  Thus,  the  LPM  may  provide  an  attractive  alternative  to  the  FROM 
for  modeling  longitudinal  degenerate  networks. 

Data  for  Comparison 


Four  data  sets  are  used  to  demonstrate  the  efficacy  of  the  FPM.  The  first  and  second  are 
longitudinal  data  sets  that  are  well  established  in  the  SNA  literature,  namely  the  Sampson  (1969) 
Monastery  data  and  the  Newcomb  (1961)  Fraternity  data.  The  third  and  fourth  data  sets  are 
larger  in  size.  For  the  reader’s  convenience.  Table  1  summarizes  the  similarity  and  difference 
among  the  data  sets.  All  four  are  explained  in  more  detail. 
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Table  1.  Data  Summary. 


Name  of  data  set 

Monastery 

Fraternity 

Leavenworth  ‘05 

Leavenworth  ‘07 

Author 

Sampson 

Newcomb 

Graham 

Schrieber 

Number  of  nodes 

18 

17 

156 

68 

No.  of  time  periods 

3 

15 

8 

9 

Method  of  collection 

Observation 

Survey 

Survey& 

Observation 

Survey 

Link  weight 

Dichotomous 

Weighted 

Dichotomous 

Dichotomous 

Link  Relationship 

Interpersonal 

relationship 

Preference 

ranking 

Self  Reported 
Communication 

Self  Reported 
Communication 

Change  in  density 

0.17974- 

0.18301 

0.50000- 

0.50000 

0.01431- 

0.02906 

0.04473- 

0.04628 

Change  in  average 

0.05556- 

0.33574- 

0.00880- 

0.02009- 

betweenness 

0.05556 

0.41176 

0.00994 

0.01909 

Change  in  average 

0.40158- 

0.66510- 

0.03759- 

0.05739- 

closeness 

0.02485 

0.39859 

0.05172 

0.08186 

Change  in  average 

0.23428- 

0.79907- 

0.23591- 

0.2125- 

eigenvector  cent 

0.23247 

0.74891 

0.22963 

0.22243 

The  first  data  set  was  eolleeted  in  a  monastery  by  Samuel  F.  Sampson  (1969).  The 
partieipants  ineluded  18  monks,  and  data  was  reeorded  on  their  interpersonal  relationships.  This 
is  a  direeted  network,  where  relationships  are  not  neeessarily  reeiproeal.  Data  was  eolleeted  over 
three  time  periods,  representing  the  time  in  whieh  a  new  eohort  joined  the  monastery. 

The  seeond  data  set  was  eolleeted  by  Theodore  Neweomb  (1961)  at  the  University  of 
Miehigan.  The  participants  included  17  incoming  transfer  students,  with  no  prior  acquaintance, 
who  were  housed  together  in  fraternity  housing.  The  participants  were  asked  to  rank  their 
preference  of  individuals  in  the  house  from  1  to  16,  where  1  is  their  first  choice.  Data  was 
collected  each  week  for  15  weeks,  except  for  week  number  nine.  The  relational  data  recorded 
between  agents  were  ranks.  Both  the  ERGM  and  LPM  require  dichotomous  networks  to 
construct  a  model.  I  chose  to  adopt  the  binarization  scheme  proposed  by  David  Krackhardt 
(1998).  He  dichotomized  the  network  data  by  assigning  a  link  to  preference  ratings  of  1-8  and 
having  no  link  for  ratings  of  9-16.  Krackhardt  also  fit  an  ERGM  to  the  Newcomb  Eraternity  data 
which  will  be  used  for  comparison  with  the  EPM. 

The  third  data  set  was  collected  from  an  Army  war  fighting  simulation  at  Port 
Eeavenworth,  Kansas  in  2005,  by  Craig  Schreiber  and  Eieutenant  Colonel  John  Graham.  The 
participants  were  mid-career  U.S.  Army  officers  taking  part  in  a  brigade  level  staff  training 
exercise.  This  data  set  contains  156  individual  agents  that  were  monitored  over  the  course  of 
four  and  a  half  days.  Data  consists  of  communication  ties  between  individuals  as  measured  from 
self  reported  communications  surveys.  Surveys  were  completed  at  the  end  of  each  morning  and 
at  the  end  of  the  day  before  the  officers  went  home.  Therefore  there  are  nine  longitudinal  time 
periods. 

The  fourth  data  set  was  also  collected  from  an  Army  war  fighting  simulation  at  Port 
Eeavenworth,  Kansas  by  Craig  Schreiber;  this  time  in  April,  2007.  There  were  68  participants  in 
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this  data  set,  who  served  as  staff  members  in  the  headquarters  of  the  brigade  eonducting  a 
simulated  training  exereise.  The  data  eontains  the  eommunieation  between  agents  in  the  network 
whieh  were  eolleeted  through  self  reported  eommunieations  surveys.  Data  was  eolleeted  over  a 
period  of  four  days,  twiee  per  day.  Thus,  there  were  eight  time  periods. 


Method  of  Comparison 


The  ERGM  and  LPM  are  investigated  for  their  strengths  and  weakness  in  modeling 
longitudinal  data.  For  the  Sampson  (1969)  Monastery  data,  I  use  the  ERGM  that  was  fit  to  the 
data  by  Hunter  et  al  (2008).  The  Akaike  Information  Criterion  (AIC)  is  302.61  and  the  Bayesian 
Information  Criterion  (BIC)  is  436.65.  The  Hunter  (2008)  ERGM  of  the  Sampson  (1969)  data 
was  ehosen  for  this  study  based  on  its  more  favorable  AIC  and  BIC  eompared  to  other  models 
found  in  the  literature.  I  feel  that  this  model  is  therefore  an  appropriate  benehmark  for 
eomparison  with  the  EPM.  An  ERGM  is  also  fit  to  the  Newcomb  (1961)  fraternity  data.  Again, 
I  have  chosen  an  ERGM  accepted  in  the  literature;  this  time  the  model  proposed  by  Krackhardt 
(1998).  An  EPM  is  fit  to  both  the  Sampson  and  Newcomb  data  sets.  Monte  Carlo  simulation  is 
used  to  generate  instances  of  the  Sampson  Monastery  social  network  and  the  Newcomb 
Fraternity  social  network  under  the  ERGM  and  LPM.  In  addition,  an  LPM  is  also  fit  to  the  two 
Fort  Leavenworth  data  sets  (Graham,  2005;  Bailer,  et.  ah,  2008).  For  the  two  Fort  Leavenworth 
data  sets,  the  ERGM  was  degenerate.  The  ERGM  were  not  degenerate  for  the  Sampson  or 
Newcomb  data  sets.  The  LPM  is  successfully  used  to  model  all  data  sets. 

A  distance  measure  is  required  to  compare  the  similarity  between  the  dichotomous 
networks  generated  using  the  ERGM,  the  LPM,  and  the  empirical  data.  Hamming  distance 
(1950)  evaluates  a  distance  between  dichotomous  networks.  If  the  data  were  weighted  networks 
and  the  models  generated  weighted  networks  as  well,  then  a  Euclidean  distance  would  be 
appropriate.  The  quadratic  assignment  procedure  (QAP)  (Krackhardt,  1987b)  could  be  used  to 
compare  the  correlation  between  networks;  however,  we  focus  on  network  distance,  because  we 
intend  to  demonstrate  that  the  LPM  can  generate  simulated  models  that  are  very  similar  to  the 
original  networks  in  terms  of  actual  distance  and  not  simply  a  structural  isomorphism. 

The  ERGM  and  LPM  are  evaluated  on  how  well  they  model  empirical  data  using  a  t-test. 
I  illustrate  the  method  with  the  Sampson  Monastery  data.  Let  the  three  networks  in  the 
Monastery  data  be  labeled  Nl,  N2,  and  N3.  An  ERGM  is  used  to  simulate  networks  and  they  are 
labeled  El,  E2,  E3,  ...  E100,000.  The  LPM  is  also  used  to  simulate  networks  and  they  are 
Labeled  El,  L2,  L3,  ...  L100,000.  The  Hamming  distances  are  calculated  between  each 
empirical  data  set  to  every  simulated  ERGM  network  and  I  use  the  following  notation. 


DistERGM,i,i  =  Hamming(Nl,El) 
DistERGM,i,2  =  Hamming(Nl,E2) 


DistERGM,y  =  Hamming(N/,E7) 


DistERGM, 3, 100000  =  Hamming(N3,E100000). 
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The  Hamming  distances  are  also  calculated  between  each  empirical  data  set  and  every  simulated 
LPM  network  and  its  notation  is  given  by, 


DistLPM,!,7  =  Hamming(Nz,L7). 


The  Hamming  distances  are  calculated  between  each  empirical  data  set  and  every  other  empirical 
data  set  and  its  notation  is  given  by, 

Distempiricai.zj  =  Hamming(Nz,N7),  where  i  ^  j. 

This  last  set  of  Hamming  distances  are  a  measure  of  noise  or  observation  error  inherent  in  the 
data. 


The  ERGM  and  LPM  are  compared  using  a  two-sample  T-test  between  the  Hamming 
distances  from  the  empirical  network,  Nz,  and  all  of  the  simulated  networks  from  the  ERGM  and 
the  LPM.  The  test  statistic  is  given  by, 

rj,  _  i^ERGMJi  ~ 

*  5pf/v  100-000 


where, 

"  100, 000 


-  100, 000 

J 

and  Sp_i  is  the  pooled  standard  deviation  between  the  ERGM  and  LPM  Hamming  distances 
(Montgomery,  1991).  This  is  repeated  for  each  time  period,  z. 

Results 


An  ERGM  was  fit  to  the  Sampson  (1969)  Monastery  data  according  to  the  model 
specification  laid  out  by  Hunter,  et.  al.  (2008).  Lour  model  terms  were  used;  links,  sender, 
receiver,  and  mutual.  A  summary  of  the  model  fit  is  shown  in  Table  2. 
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Table  2.  Fit  Summary  for  Sampson  ERGM. 


Model  Parameter 

Coefficient 

Standard  Error 

MCMC  S.E, 

Links 

-2.5131 

0.3361 

0.005 

0.0000 

sender! 

-0.7356 

0.6854 

0.015 

0.2842 

sender3 

-0.2146 

0.7274 

0.017 

0.7682 

. . .  output  edited  for  length  . . . 

receiver  17 

-1.2015 

0.8191 

0.018 

0.1436 

receiver  18 

-1.0562 

0.7193 

0.015 

0.1432 

Mutual 

3.6816 

0.6731 

0.011 

0.0000 

The  Hamming  distance  from  each  of  the  three  empirical  data  sets  to  each  of  the  ERGM 
simulated  networks  was  calculated.  The  Hamming  distance  from  each  of  the  empirical  data  sets 
to  each  of  the  LPM  simulated  networks  was  calculated.  The  mean  and  standard  deviation  of 
these  Hamming  distances  are  displayed  in  Table  3.  A  two-sample  t-test  for  each  time  period 
illustrates  that  the  networks  simulated  using  the  LPM  have  a  smaller  average  hamming  distance 
to  the  empirical  data  sets  than  the  networks  simulated  using  the  ERGM.  This  indicates  that  the 
LPM  models  the  Sampson  data  more  accurately  than  the  ERGM  model. 


Table  3.  Sampson  Data  Hamming  Distances  and  T-test  for  ERGM  and  LPM. 


Time 

Period 

/^ERGM,i 

ERGM 

Hamming 

Distance 

Standard 

Deviation 

/^LPM,i 

LPM 

Hamming 

Distance 

Standard 

Deviation 

T, 

t-test 

p-value 

1 

98.70 

5.6970 

27.67 

3.5922 

39.43 

0.0006 

2 

99.10 

6.2263 

24.99 

3.5935 

37.64 

0.0007 

3 

103.70 

6.2902 

24.66 

3.5945 

39.74 

0.0006 

The  Newcomb  (1961)  Eratemity  data  was  also  fit  with  an  ERGM.  Three  model  terms 
were  used;  mutual,  Simmelian  ties,  and  balance.  A  summary  of  the  model  fit  is  shown  in  Table 
4.  The  AIC  is  308.93  and  the  BIG  is  319.75,  which  are  more  favorable  than  similar  variations  of 
the  ERGM. 


Table  4.  Fit  Summary  for  Newcomb  ERGM. 


Model  Parameter 

Coefficient 

Standard  Error 

MCMC  S.E. 

p-value 

Mutual 

-1.5745 

0.2304 

0.0070 

0.0000 

Simmelian  Ties 

0.6581 

0.0006 

0.0001 

0.0000 

Balance 

0.2333 

0.0364 

0.0010 

0.0000 
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The  Hamming  distances  from  each  of  the  fourteen  empirical  data  sets  to  each  of  the 
ERGM  simulated  networks  and  each  of  the  LPM  simulated  networks  were  calculated.  The  mean 
and  standard  deviation  of  these  Hamming  distances  are  displayed  in  Table  5.  A  two-sample  t- 
test  for  each  empirical  data  set  illustrates  that  the  networks  simulated  using  the  LPM  have  a 
smaller  average  hamming  distance  to  the  empirical  data  sets  than  the  networks  simulated  using 
the  ERGM.  This  indicates  that  the  LPM  models  the  Newcomb  fraternity  data  more  accurately 
than  the  ERGM  model. 


Table  5.  Newcomb  Data  Hamming  Distances  and  T-test  for  ERGM  and  LPM. 


Time 

Period 

/^ERGM,i 

ERGM 

Hamming 

Distance 

Standard 

Deviation 

/^LPM,i 

LPM 

Hamming 

Distance 

Standard 

Deviation 

T, 

t-test 

p-value 

1 

139.7 

8.3938 

91.9 

5.1913 

18.0147 

0.0353 

2 

138.9 

8.1847 

75.1 

5.2128 

24.6573 

0.0258 

3 

137.3 

8.2872 

48.3 

5.2226 

33.9732 

0.0187 

4 

135.5 

9.3363 

49.7 

5.2340 

29.0460 

0.0219 

5 

134.1 

8.9870 

50.1 

5.2319 

29.5558 

0.0215 

6 

136.3 

8.5251 

45.5 

5.2440 

33.6983 

0.0189 

7 

133.9 

9.0609 

47.3 

5.2397 

30.2202 

0.0211 

8 

134.1 

7.2946 

51.9 

5.2591 

35.6377 

0.0179 

10 

133.7 

5.1865 

64.2 

5.2223 

42.3990 

0.0000 

11 

132.7 

6.0562 

53.4 

5.2074 

41.4119 

0.0006 

12 

136.3 

8.4466 

51.1 

5.2147 

31.8930 

0.0200 

13 

134.9 

9.0117 

46.6 

5.2311 

30.9989 

0.0205 

14 

133.9 

5.4457 

46.1 

5.2230 

50.9574 

0.0000 

15 

133.1 

5.7242 

47.2 

5.2378 

47.4518 

0.0004 

The  LPM  is  further  investigated  using  the  Port  Leavenworth  data.  ERGM’s  with  only  a 
single  term  were  found  to  be  degenerate  for  several  common  parameter  choices;  therefore,  they 
are  not  included  in  the  analysis  of  this  section.  Lor  both  of  the  Port  Leavenworth  data  sets,  the 
Hamming  distance  between  the  simulated  LPM  networks  and  each  empirical  network,  DisRpM,/./ 
=  Hamming(N/,L7),  was  compared  to  the  Hamming  distance  between  each  empirical  network  to 
the  other  empirical  networks  within  the  data  set,  Distempiricai,/,;  =  Hamming(Nz,N7),  where  i  4-  j- 
Two-sample  t-tests  were  used  to  determine  if  there  was  a  significant  difference  in  mean 
Hamming  distance  between  the  empirical  networks  and  the  LPM.  The  t-tests  were  properly 
adjusted  for  heteroscedasticity  and  unequal  sample  sizes.  Table  6  displays  the  Hamming 
distances  and  the  results  of  the  two-sample  t-tests  for  the  2005  Port  Leavenworth  data,  and  Table 
7  displays  this  information  for  the  2007  Port  Leavenworth  data.  In  all  cases  the  Hamming 
distance  is  less  for  the  LPM.  The  low  p-values  show  a  statistically  significant  difference  in  mean 
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Hamming  distance  of  the  empirieal  to  empirieal  eomparison  versus  the  LPM  to  empirieal 
eomparison.  Additionally,  sinee  i  ~  I^lpm  ,  >  0  it  is  shown  that  the  simulated  LPM 

networks  have,  on  average,  less  Hamming  distanee  from  eaeh  of  the  empirieal  data  sets  than  the 
empirieal  data  sets  have  from  eaeh  other.  This  means  that  networks  generated  using  the  LPM  are 
eloser  to  the  original  data  than  the  observed  empirieal  networks  are  to  eaeh  other.  While  the  t- 
tests  for  2005  Fort  Leavenworth  time  periods  6,  8,  and  9  are  only  marginally  signifieant,  they 
have  the  same  positive  trend  as  the  other  14  empirieal  networks  in  the  2005  and  2007  data  sets. 


Table  6.  2005  Fort  Leavenworth  Data  Hamming  Distances  and  T-test  for  LPM. 


Time 

Period 

^empirical,/ 

Empirical 

Hamming 

Distance 

Standard 

Deviation 

/^LPM,/ 

LPM 

Hamming 

Distance 

Standard 

Deviation 

T, 

t-test 

p-value 

1 

1445.000 

84.774 

1284.338 

23.747 

3.467 

0.001 

2 

1394.750 

67.487 

1239.647 

23.703 

3.765 

0.000 

3 

1296.125 

85.436 

1151.946 

23.671 

3.287 

0.001 

4 

1315.875 

153.533 

1169.665 

23.718 

2.421 

0.015 

5 

1191.250 

112.324 

1058.990 

23.667 

2.732 

0.006 

6 

1204.875 

207.944 

1071.116 

23.623 

1.912 

0.056 

7 

1167.375 

190.431 

1037.713 

23.695 

1.980 

0.048 

8 

1159.625 

204.465 

1030.815 

23.732 

1.888 

0.059 

9 

1170.125 

195.266 

1040.142 

23.618 

1.953 

0.051 

Table  7.  2007  Fort  Leavenworth  Data  Hamming  Distances  and  T-test  for  LPM. 


Time 

Period 

^empirical,/ 

Empirical 

Hamming 

Distance 

Standard 

Deviation 

/^LPM,/ 

LPM 

Hamming 

Distance 

Standard 

Deviation 

T, 

t-test 

p-value 

1 

409.286 

38.560 

358.094 

12.775 

3.755 

0.00 

2 

365.857 

18.298 

320.097 

12.739 

7.073 

0.00 

3 

365.857 

29.043 

320.164 

12.793 

4.450 

0.00 

4 

377.857 

38.247 

330.674 

12.773 

3.489 

0.00 

5 

375.286 

36.100 

328.377 

12.796 

3.675 

0.00 

6 

349.857 

38.159 

306.078 

12.785 

3.245 

0.00 

7 

373.857 

48.451 

327.073 

12.826 

2.731 

0.01 

8 

362.429 

55.635 

317.151 

12.775 

2.302 

0.02 
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Discussion 


The  LPM  has  been  used  to  model  longitudinal  soeial  network  data  for  four  different  data 
sets.  In  those  data  sets,  the  LPM  generates  simulated  networks  that  are  more  like  the  original 
data  than  networks  generated  using  the  ERGM.  In  addition,  it  is  generally  the  ease  that  the 
networks  generated  using  the  LPM  are  more  similar  to  the  original  data  than  any  prior  time 
period.  The  LPM  avoids  issues  of  model  degeneracy  due  to  its  formulation.  The  probability  of 
link  occurrence  is  based  on  the  historic  presence  of  links  and  does  not  use  a  Markov  assumption 
or  over  specify  a  statistical  model.  For  these  reasons,  the  LPM  provides  an  alternative  method 
for  modeling  and  conducting  longitudinal  social  network  analysis. 

Monte  Carlo  simulations  can  be  generated  using  the  LPM.  Each  cell,  Uij,,  in  the  EPM  can 
be  compared  to  a  uniform  (0,1)  random  variable  to  determine  the  presence  of  a  link  in  a 
simulated  adjacency  matrix.  As  demonstrated  earlier,  these  simulated  adjacency  matrices  are 
very  similar  to  the  empirical  data  as  demonstrated  by  the  low  Hamming  distance  between 
simulated  networks  and  empirical  networks.  Statistical  distributions  can  then  be  fit  to  any  social 
network  measures  calculated  on  the  simulated  networks.  These  statistical  distributions  can  then 
be  used  for  inference  using  traditional  statistical  methods. 

The  EPM  cannot  be  used  in  place  of  the  ERGM  in  all  situations,  however.  Multiple 
networks  are  required  to  estimate  the  LPM  for  a  given  empirical  data  set.  The  ERGM  on  the 
other  hand,  can  be  estimated  from  a  single  observed  network.  The  approach  to  adding  and 
removing  nodes  is  different  for  the  ERGM  and  LPM.  For  the  EPM,  a  missing  node  would  be 
included  in  the  model  with  a  0  recorded  for  all  column  and  row  entries  of  the  missing  node. 
Finally,  the  LPM  is  formulated  based  on  the  assumption  that  there  are  fixed  probability 
structures  under-laying  social  networks  that  do  not  change  significantly  over  time.  The  observed 
social  networks  based  on  the  LPM  will  fluctuate  between  time  periods,  but  the  general  patterns 
of  connections  remain  the  same.  Table  8  illustrates  some  differences  and  similarities  between 
LPM  and  ERGM  data  requirements. 


Table  8.  Comparison  of  LPM  and  ERGM. 


Data  characteristics 

LPM 

ERGM 

Link  weighting 

Dichotomous 

Dichotomous 

Number  of  links 

No  limit 

Probability  of  degeneracy 
increases  with  number  of  links 

Min.  no.  time  period 

2 

1 

Practical  no.  time  period 

5+ 

1 

Assumed  cause  of 
stochasticity 

Dynamic  equilibrium 

Evolves  due  to  structural 
properties  of  the  network. 

The  EPM  has  several  advantages  over  the  ERGM  for  longitudinal  social  network 
analysis;  however  the  ERGM  has  advantages  over  the  EPM  for  other  types  of  analysis.  Table  9 
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displays  advantages  and  disadvantages  of  the  LPM  and  ERGM  models.  The  LPM  requires 
multiple  observed  networks  to  estimate  model  parameters,  where  the  ERGM  ean  be  estimated 
using  a  single  observed  network.  At  a  minimum,  two  observed  networks  are  required  to  estimate 
an  LPM,  however,  in  praetiee;  the  varianee  of  the  estimate  is  proportionate  ioH ^[n  ,  where  n  is 
the  number  of  observed  networks.  We  nominate  five  observed  networks  as  a  rule  of  thumb  for 
fitting  the  LPM  as  most  of  the  estimate  varianee  is  eliminated  with  this  number.  The  LPM  is 
more  eomputationally  effieient  than  the  ERGM.  The  number  of  link  probabilities  for  a  network 
is  quadratie  with  the  number  of  nodes.  The  LPM  estimates  are  then  linear  with  the  number  of 
observed  networks.  The  ERGM  parameter  estimates  ean  be  n”  with  number  of  nodes  for  eaeh 
term.  Heuristies  are  often  used  to  estimate  ERGM  model  parameters.  In  addition,  the  ERGM 
has  problems  with  model  degeneraey  as  previously  discussed.  The  LPM  has  been  shown  to 
provide  a  model  that  can  be  used  to  simulate  data  that  is  more  similar  to  empirical  data  than  data 
generated  with  ERGM  simulations.  An  additional  benefit  for  the  LPM  is  the  ability  to  use  link 
probabilities  as  dependent  variables  in  regression  models  for  homophily.  Homophily  is  an 
expression  to  describe  the  similarity  between  individuals  in  terms  of  certain  attributes  that  the 
individuals  have.  In  more  complex  models,  the  parameters  of  link  probability  densities  can  serve 
as  dependent  variables  in  homophily  regression.  Unfortunately,  the  LPM  does  not  provide  any 
explanation  of  likely  structural  causes  for  the  stochastic  behavior  of  networks.  Significant  terms 
in  an  ERGM  can  be  interpreted  as  the  underlying  mechanism  for  network  evolution  over  time.  It 


Table  9.  Advantages  and  Disadvantages  of  LPM  and  ERGM. 


Considerations 

LPM 

ERGM 

Required  no.  of 
observed  networks 

Disadvantage:  The  LPM 
requires  multiple  observed 
networks  to  estimate  the 
link  probability  of  a 
network  based  on  historic 
frequency  of  occurrence. 

Advantage:  The  ERGM 
requires  only  a  single  network 

Computational 

efficiency 

Advantage:  The 
computational  speed  is 
quadratic  with  the  number 
of  nodes  in  the  network. 

Disadvantage:  The 
computational  speed  is  n” 
which  requires  heuristic 
approximations  of  model 
parameters. 

Model  quality 

Advantage:  Stable  and 
consistent  model  estimates. 

Disadvantage:  Prone  to 
degenerate  models. 

Accuracy  to  real  data 

Advantage:  Shown  to  more 
closely  resemble  empirical 
data  as  measured  by 
Hamming  distance. 

Disadvantage:  Has  not  been 
shown  to  consistently  model 
empirical  data  accurately  as 
measured  by  Hamming 
distance. 

Explanation  of  social 
dynamics 

Disadvantage:  Does  not 
attempt  to  explain 
underlying  social  dynamics 
of  the  group  or  organization. 

Advantage:  Model  terms  can 
be  interpreted  as  underlying 
mechanisms  for  social 
dynamics  within  the  modeled 
group  or  organization. 
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may  be  possible  to  develop  similar  explanations  of  behavior  through  future  researeh  in 
homophily  regression  using  the  LPM.  Further  researeh  is  needed  on  both  the  ERGM  and  the 
LPM  to  illuminate  strengths  and  limitations.  In  the  interim,  there  is  strong  evidenee  to  suggest 
the  use  of  the  LPM  whenever  degeneraey  is  a  problem  among  ERGM’s,  or  when  the  goal  is  to 
estimate  the  normal  behavior  of  a  soeial  group  that  is  in  dynamie  equilibrium. 

Another  important  area  for  future  researeh  is  network  periodieity.  Intuitively,  soeial 
networks  are  subjeet  to  periodie  trends.  An  average  person’s  eommunieation  patterns  may  be 
different  during  the  week,  while  they  are  at  work,  than  during  the  weekend,  when  they  are  at 
home  with  their  family.  Euture  researeh  will  hopefully  expand  both  the  LPM  and  ERGM  to 
handle  periodie  trends  in  longitudinal  data.  It  will  be  interesting  to  eompare  the  performanee  of 
the  LPM  and  ERGM  for  modeling  time  dependent  longitudinal  soeial  network  data  sets. 

This  paper  has  introdueed  the  Link  Probability  Model  (LPM)  for  longitudinal  soeial  network 
analysis.  The  primary  strength  of  the  LPM  is  its  ability  to  aeeurately  model  longitudinal  network 
behavior  with  better  goodness  of  fit  than  eompeting  models.  The  LPM  also  avoids  issues  of 
model  degeneraey  due  to  the  method  of  its  eonstruetion.  Linally,  the  LPM  is  more 
eomputationally  effieient  than  the  ERGM  for  both  estimation  and  simulation.  Using  the  LPM, 
aeeurate  simulation  of  longitudinal  soeial  network  data  ean  be  performed.  This  opens  the  door 
for  researehers  to  explore  an  entirely  new  approaeh  for  inferenee  on  soeial  networks. 
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